Types of resampling techniqes and considerations on which to choose
When I was trying out PLS regression on the gasoline data (a small dataset of 60 samples with NIR spectra measured over more than 400 wavelengths), I was a little stuck at the resampling part. Most examples online used k-fold cross validation as the resampling method, but I was also wondering about bootstrapping, because I remembered that my Prof mentioned it would also help in model performance to prevent overfitting.
I think there was a need for me to get my concepts right for:
I got the information below from the book Applied Predictive Modelling http://appliedpredictivemodeling.com/, Chapter 4.
Samples are randomly divided into k sets of the same size. A model is fit using all the samples except the first subset. The first subset is then used to assess model performance. The whole process is repeated k times, using other subsets. Model performance may be assessed by comparing error rates or r-squared values.
k-fold cross-validation usually has a high variance (low bias) compared to other methods. However, for large training sets, the variance-bias issue should be negligible.
k is usually fixed as 5 or 10. The bias is smaller for k = 10.
This method is recommended for tuning model parameters to get the best indicator of performance for small sample sizes as the bias-variance properties are good, and does not come with high computational costs (unlike leave-one-out method).
A bootstrap sample is a random sample of the data taken with replacement. Some samples may be selected multiple times, while there may be others that are not selected at all. In general, bootstrap error rates tend to have less uncertainty than k-fold.
This method may be preferred if the aim is to choose between models, as bootstrapping technique has low variance (high bias).
There is a spectrum of interpretability and flexibility for models. Choose variaous models that occur at different parts of the spectrum, for example, a simple and inflexible but easy to interpret linear model, as compared to partial least squares which is lower down the interpretability but higher up the flexibility, as compared to random forests which are hard to interpret but very flexible.
I am really glad to have this Applied Predictive Modelling book with me. It clarified a lot on the concepts part, but most of the code is given in the older coding styles. Looking forward to try on the worked examples using tidymodels framework!
Applied Predictive Modelling, Max Kuhn and Kjell Johnson, Chapter 4. http://appliedpredictivemodeling.com/
For attribution, please cite this work as
lruolin (2021, March 11). pRactice corner: Notes - Resampling. Retrieved from https://lruolin.github.io/myBlog/posts/20210311_notes on resampling/
BibTeX citation
@misc{lruolin2021notes, author = {lruolin, }, title = {pRactice corner: Notes - Resampling}, url = {https://lruolin.github.io/myBlog/posts/20210311_notes on resampling/}, year = {2021} }